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Abstract 

The GC content in the third codon position (GC 3 ) exhibits a unimodal distribution in many plant and animal genomes. Interestingly, 
grasses and homeotherm vertebrates exhibit a unique bimodal distribution. High GC 3 was previously found to be associated with 
variable expression, higher frequency of upstream TATA boxes, and an increase of GC 3 from 5 r to 3' '. Moreover, GC 3 -rich genes are 
predominant in certain gene classes and are enriched in CpG dinucleotides that are potential targets for methylation. Based on the 
GC 3 bimodal distribution we hypothesize that GC 3 has a regulatory role involving methylation and gene expression. To test that 
hypothesis, we selected diverse taxa (rice, thale cress, bee, and human) that varied in the modality of their GC 3 distribution and tested 
the association between GC 3 , DNA methylation, and gene expression. We examine the relationship between cytosine methylation 
levels and GC 3 , gene expression, genome signature, gene length, and other gene compositional features. We find a strong negative 
correlation (Pearson's correlation coefficient r= -0.67, Pvalue < 0.0001 ) between GC 3 and genie CpG methylation. The comparison 
between 5'-3' gradients of CG 3 -skew and genie methylation for the taxa in the study suggests interplay between gene-body 
methylation and transcription-coupled cytosine deamination effect. Compositional features are correlated with methylation levels 
of genes in rice, thale cress, human, bee, and fruit fly (which acts as an unmethylated control). These patterns allow us to generate 
evolutionary hypotheses about the relationships between GC 3 and methylation and how these affect expression patterns. 
Specifically, we propose that the opposite effects of methylation and compositional gradients along coding regions of GC 3 -poor 
and GC 3 -rich genes are the products of several competing processes. 

Key words: DNA methylation, gene expression, GC 3 , grasses, homeotherms, Oryza sativa, Apis mellifera, Homo sapiens, 
Arabidopsis tha liana. 



Introduction 

The term epigenetics was coined in 1957 by Conrad Hal 
Waddington (Slack 2002). It is defined as the study of changes 
in gene expression due to mechanisms other than alterations 
to the DNA sequence; that is expression modifications are not 
hard coded into the nucleotide sequence. Consequently, 
epigenetics explains phenomena, which do not result from 
standard genetic mutations, like hereditary changes in gene 
expression under the influence of environmental factors. DNA 
methylation is one of the most studied epigenetic mechanisms 
modulating gene expression and has important health impli- 
cations. For example, the gain or loss of DNA methylation can 
produce loss of genomic imprinting and results in diseases 



such as Beckwith-Wiedemann, Prader-Willi, and Angelman 
syndromes (Adams 2008). Changes in the patterns of DNA 
methylation are also commonly seen in human tumors. Both 
genome wide hypomethylation (insufficient methylation) and 
region-specific hypermethylation (excessive methylation) have 
been thought to play a role in carcinogenesis (Lengauer 2007). 
DNA hypomethylation contributes to cancer development 
through an increase in genomic instability, reactivation of 
transposable elements, and loss of imprinting (Esteller 2002). 
Hypermethylation-induced silencing of primary transcripts 
through their CpG island promoters is a common cause 
of the loss of tumor suppressor miRNAs in cancer 
(Lengauer 2007; Lopez-Serra and Esteller 2012; Sonkin et al. 
2013). 
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Methylation occurs by the addition of a chemical methyl 
group (-CH 3 ) through a covalent bond to the cytosine bases 
of the DNA backbone and tends to be more abundant at 
Cytosine-phosphate-Guanine (CpG) dinucleotides (Sadikovic 
2008). However, methylation can also happen in CHG and 
CHH contexts (where H indicates any nucleotide other than 
G). DNA methylation is common in humans and other mam- 
mals, where 70-80% of CpG dinucleotides are methylated. 
Interestingly, in some model organisms, such as yeast and fruit 
fly, there is little or no DNA methylation. Also, DNA methyla- 
tion in mammals differs from that in plants as it targets CpG 
sites. In humans and mice, CpG dinucleotides account for 
roughly three quarters of the total DNA methylation content 
in their cells (Ziller et al. 2011). 

In vertebrates, the methylation process is being catalyzed 
by members of the enzyme family of DNA methyltransferases 
(DNMTs), which recognize palindromic sequences with CpG 
dinucleotides. Thus far, three active DNMTs have been iden- 
tified in mammals: DNMT1 , DNMT3A, and DNMT3B. A fourth 
similar enzyme (DNMT2 or TRDMT1) is structurally similar to 
the other DMNTs. However, it does not methylate DNA but 
rather transfers RNA (Goll et al. 2006). DNA methylation of 
CpG dinucleotides is essential for plant and mammalian de- 
velopment. Methylation mediates the expression of genes and 
plays a key role in chromosome X inactivation, genomic im- 
printing, embryonic development, chromosome stability, 
chromatin structure, and may also be involved in the immo- 
bilization of transposons and the control of tissue-specific 
gene expression (Li et al. 2008). 

The relationship between gene expression, nucleotide com- 
position, and gene length were the subject of several studies 
in the past decades. Oliver and Marin (1996) associated the 
expected length of a reading frame to the CG composition 
using the property that stop codons (TAG, TAA, and TGA) are 
biased toward low GC content. They suggested that the lon- 
gest coding sequences/exons in vertebrates are GC rich, while 
the shortest ones are GC-poor. Subsequently, Xia et al. (2003) 
described positive correlations between GC content and 
coding regions (CDS) lengths in 68 genomes. It was later 
shown that highly expressed rice and human GC-rich genes 
have significantly more and longer introns than lowly ex- 
pressed genes, whereas their average exon length per gene 
is significantly lower. By contrast, GC-poor genes were shown 
to exhibit similar compactness between highly and lowly 
expressed genes (Mukhopadhyay and Ghosh 2010). 

The relationship between gene-body methylation and gene 
expression was studied in a number of organisms, and a pos- 
itive linear correlation was reported (Xiang et al 201 0; Zemach 
et al. 2010). Anastasiadou et al. (201 1) reported the relation- 
ship between splicing and methylation in the human genome 
as well as a positive relationship between alternative splicing 
and methylation. Recently, Flores et al. (2012) reported a pos- 
itive relationship between exon-level DNA methylation and 
mRNA expression in the honeybee. They also found that 



methylated genes are enriched for alternative splicing; there- 
fore suggesting that gene-body DNA methylation positively 
influences exon inclusion during transcription. The authors 
proposed that DNA methylation and alternative splicing con- 
tribute to a longer gene length and a slower rate of gene 
evolution. However, none of these studies considered the 
potential regulatory role of GC 3 . 

Several studies focused on coding regions that are enriched 
in methylation targets (CpG-rich). For example, Nanty et al. 
(2011) found an evolutionarily conserved feature in inverte- 
brate genomes separating CpG-poor and CpG-rich genes: 
CpG-poor genes were associated with basic biological pro- 
cesses, while the latter with more specialized functions. 
Gavery and Roberts (2010) found that hypo- and hypermeth- 
ylated genes differ in both biological function and in the ratio 
between observed and expected CpG dinucleotides. Coding 
regions enriched in CpG dinucleotides also exhibit a higher 
frequency of G or C in the third codon position (GC 3 ). 
Because mutations in this position lead primarily to synony- 
mous substitutions, the selective pressures affecting its com- 
position are different from those acting on the first two codon 
positions, making it a valuable tool to study evolution. To 
name a few, it has been previously shown (Tatarinova et al. 
201 0; Sablok et al. 201 1 ; Ahmad et al. 201 3) that dicot and 
monocot plant genes with high GC 3 have distinctly different 
properties from genes with low GC 3 : they contain more tar- 
gets for methylated GC 3 -rich genes, and also exhibit more 
variable expression, possess more upstream TATA boxes, are 
enriched for certain classes of genes (e.g., stress responsive 
genes), and have a GC 3 content that increases from 5' to 3 r 
(Tatarinova et al. 2010). GC 3 -rich genes were also shown to 
be inducible while the GC 3 -poor are ubiquitously active 
(Tatarinova et al. 2010). Thus we speculate that GC 3 has 
evolved to be interdependent with gene-body methylation 
and gene expression so that genes that are GC 3 -rich or 
-poor have different expression patterns. 

Here, we tested the hypothesis of the regulatory role of 
GC 3 by studying the relationship between GC 3 , gene-body 
methylation, and related genomic features in four taxa: rice, 
arabidopsis, bee, and human. These particular species were 
chosen because they have well-annotated genomes, rich col- 
lections of gene expression measurements, and genome-wide 
methylation measurements. Comparison with the fruit fly 
allows us to separate methylation-related effects from other 
factors. We show that GC 3 is inversely correlated with 
gene methylation in these four organisms and propose an 
evolutionary theory to explain these patterns. 

Materials and Methods 

Gene models were taken from MSU (version 6.1) for Oryza 
sativa; TAIR version 7 for Arabidopsis thaiiana; BeeBase (www. 
beebase.org) annotation for Apis mellifera; NCBI GenBank for 
Homo sapiens (hg18); and dmel_hetr31 from FlyBase (www. 
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flybase.org) as well as Release 5 from Berkeley Drosophila 
Genome Project (www.fruitfy.org) for Drosophila 
melanogaster. 

Gene expression data were obtained from the NCBI 
GEO collection (GSE9415, GSE24177, GSE5624, GSE1647, 
GSE1 9700, GSE9646-GPL1 0978, GSE9646-GP1 0977, 
GSE16474, GSE34029, GSE34293, GSM846863, GSE25161, 
GSE34029, GSE34293, GSE42255, GSE5147, GSE1643, 
GSE7567, GSE16144, GSE21009-GPL10237). 

Filtering 

We selected gene sets where gene expression, methylation, 
and high-quality annotation data were available: there were 
12,577 such genes in A thaliana, 14,069 in H. sapiens, 9,607 
genes in O. sativa, and 15,381 genes in Api. mellifera. For 
Drosophila melanogaster we used 18,731 coding sequences. 

Methylation bisulfite sequencing measurements for the 
four organisms were obtained from previously published stud- 
ies (Chodavarapu etal. 2010; Bernal etal. 2012; Chodavarapu 
et al. 201 2; Foret et al. 201 2). We required a minimum of five 
reads to call the methylation state of a cytosine. The DNA 
methylation level was estimated from the fraction of cytosines 
that failed to undergo bisulfite conversion. Therefore, for each 
cytosine, the methylation level ranged from 0 to 1 . When we 
computed average gene-body methylation for a given con- 
text, we calculated the average methylation for all coding 
regions, using appropriate gene models for each organism. 
For H. sapiens we used H1 embryonic stem cell line methyla- 
tion profile. The distributions of gene-body methylation levels 
are shown in fig. IB. 

GC 3 

For every open reading frame, GC 3 was computed as 
GC 3 = (C 3 +G 3 )/(/./3), where C 3 and G 3 are counts of cyto- 
sines and guanines in the third position of the codon and L is 
the length of the coding sequence. 

GC 3 distributions are obtained from a histogram of GC 3 
values, where GC 3 values were rounded to hundredths (figs. 2 
and 4) or tenth (figs. 5 and 6). We require that all points on the 
graph were supported by at least 100 observations, criteria 
which determined the choice of the bin size. 

Standardization of Gene Expression (Z-Statistic) 

For a gene g, Z g (Expression) = (E g - e)/a, where E g is the 
average expression of the gene g across N experiments, e is 
average expression of all genes, and a is the SD of gene ex- 
pression. All expression levels were log-transformed. The 
genes were divided into three groups based on their expres- 
sion level, namely Z g (Expression) > 1, -1 < Z g (Expression) 
< 1 , and Z g (Expression) < -1 . 

The genome signature (p CG ) is defined as the relative abun- 
dance of the frequency of dinucleotides in the genome, so 
that pcg = fcG/fcfe, where f x is the frequency of a (di) 



nucleotide. Genomes or genes can thus be compared with 
respect to their relative abundance of methylation targets and 
GC 3 richness. 

CG 3 -Skew 

Following (Tatarinova et al. 2003), CG 3 -skew was defined 
as CG 3skew = (C 3 - G 3 )/(C 3 +G 3 ). We calculated the 5'-3' 
CG 3 -skew gradient patterns in arabidopsis, rice, bee, fruit 
fly, and human by counting the number of Cs and Gs in the 
third position of codons in the first 200 codons of GC 3 -rich 
and GC 3 -poor genes. 

Expression Measures 

We use mean expression value across all collected experiment 
for every gene, SD of gene expression values across all condi- 
tions, and coefficient of variation (CV), defined as a ratio or SD 
and mean gene expression. 

Distinguishing GC 3 -Rich from GC 3 -Poor Genes 

Since GC 3 varies between organisms, such definitions are or- 
ganism-specific and depend on the shape of its distribution 
which can be either unimodal or bimodal (fig. ^A). In the case 
of unimodal bell-shaped distribution, common to many plant 
and animal species, the extreme 5% of the genes from the 
tails of the distributions are denoted as "GC 3 -rich" and "GC 3 - 
poor" genes (Sablok et al. 201 1 ; Ahmad et al. 201 3). By con- 
trast, for bimodal distributions that are common to grasses 
and homeotherm vertebrates (Elhaik et al. 2009; Elhaik and 
Tatarinova 2012), the GC 3 cutoff is determined based on the 
position of the "valley" between the two peaks. 

Gene Ontology Annotation 

Gene ontology (GO) annotations were obtained from www. 
geneontology.org (last accessed January 15, 2013), TAIR 
(www.arabidopsis.org, last accessed December 6, 2012), 
and Michigan State University (ftp.plantbiology.msu.edu, last 
accessed December 5, 2012). Upon division of genes into 
GC 3 -rich and -poor classes, we computed x 2 = (O - E) 2 /E 
statistic for each GO category (supplementary tables S4 and 
S5, Supplementary Material online). 

Results 

GC 3 , Body Methylation, and Gene Expression 

Of the guanine and cytosine (GC) content at each codon po- 
sition (GC 1f GC 2 , GC 3 ), the last measure represents the frac- 
tion of GC content in the codon's wobble position that has the 
most freedom to change without altering amino acid se- 
quence of the gene. GC 3 exhibits the strongest Pearson's cor- 
relation with gene-body methylation (r GC1 = -0.47, 
r GC2 = -0.35, r GC3 = -0.67) and variability of gene expres- 
sion (r GC1 =0.1, r GC2 = 0.14, r GC3 = 0.21) and is correlated 
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Fig. 1. — Distributions of GC 3 content for rice, Arabidopsis, bee, and human. 



with the gene's GC content (e.g., in rice correlation between 
genie GC and GC 3 is 0.94). 

Due to the different shapes of the GC 3 distributions in the 
studied taxa (fig. ^A) I we hypothesized that the GC 3 content 
has a regulatory role and should be correlated with both CpG 
methylation and gene expression which, in turn, should also 
be correlated with one another. To test our hypothesis, we 
carried detailed analyses of the relationship between GC 3 
composition, gene-body methylation, and gene expression 
in rice, arabidopsis, honey bee, and human. As expected, in 
all four species, GC 3 and genie CpG methylation were nega- 
tively correlated and CpG methylation had a consistently neg- 
ative effect on the variability of gene expression (table 1). The 
relationship between GC 3 and average gene expression is 
nonlinear and saddle-like for all four organisms (fig. 2), but 
the strength of the dependencies varies from organism to 
organism. 

We compared full and partial correlation coefficients, cal- 
culated as in Kim and Yi (2007), between GC 3 , gene expres- 
sion variability, and gene-body methylation (table 1). We 
found that the relationship between gene-body methylation 
and GC 3 is approximately the same, when controlling for var- 
iability of gene expression as compared to the full correlation 
coefficient. Partial correlations between gene expression var- 
iability and methylation and between GC 3 and gene 



expression variability are much smaller than the full correlation 
coefficients. These results suggest that the relationship be- 
tween GC 3 and gene-body methylation is the driving force 
and confounds the two other correlations. 

In the following sections, we describe the relationship be- 
tween GC 3 , gene-body methylation, and gene expression for 
each of the four organisms we investigated. 

Oryza sativa 

In rice, distributions of GC 3 and gene-body methylation are 
both clearly bimodal (fig. 1). Genes can be divided into GC 3 - 
rich and -poor classes using the position of the valley between 
the two peaks (at GC 3 « 0.8) and, similarly, into highly meth- 
ylated and lowly methylated classes (gene-body methyla- 
tion « 0.0178). We have previously shown (Tatarinova et al. 
2010) that GC 3 -rich genes in rice have more methylation tar- 
gets (pcg) that can be used to modulate tissue-specific expres- 
sion: pcG(poor) = 0.55 and p CG (rich) = 1.15. 

To estimate the regulatory effects of GC 3 we first calcu- 
lated its correlation with different genie measures including 
intron density, the number of introns per 1000 bases, and 
intron fraction, defined as the ratio of intron length to gene 
length, for GC 3 -poor and -rich genes that are highly and lowly 
expressed (table 2). Compared with lowly expressed genes, 
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Fig. 2. — GC 3 vs expression for four organisms: bee (green), rice (blue), Arabidopsis (red), and human (purple). (A) Relationship between standardized 
values of GC 3 and average expression. (B) Gene expression variability as a function of GC 3 . Every point represents a mean across at least 100 genes and the 
standard error of the mean does not exceed 0.1 (plot A) and 0.06 (plot B). 



Table 1 

Pearson's Correlation Coefficients between CpG Methylation, GC 3 , 
and Gene Expression Variability for Oryza sativa, Arabidopsis thaliana, 
Apis mellifera, and Homo sapiens 



Correlation between 


O. sativa 


A. thaliana 


Api. 


H. sapiens 








mellifera 




CpG methylation and GC 3 


-0.67 


-0.27 


-0.65 


-0.23 




-0.65 


-0.23 


-0.62 


-0.23 


CpG methylation and 


-0.18 


-0.18 


-0.24 


-0.02 


gene expression 


-0.06 


-0.13 


-0.04 


-0.06 


variability (CV) 










Gene expression 


0.21 


0.16 


0.34 


-0.16 


variability (CV) and GC 3 


0.12 


0.12 


0.22 


-0.16 



Note. — Top numbers in each cell represents Pearson's correlation coefficients 
and bottom numbers represent partial correlation coefficients. 



highly expressed genes have an intron density approximately 
twice as high; with both the average number of exons and 
average intron length being 1.5 times higher. Remarkably, 
genie measures for highly and lowly expressed genes varied 
markedly when compared between GC 3 -poor and -rich genes 
(table 2). For instance, GC 3 -poor genes with high (£> 1) and 
low (£<— 1) expression values differ in their intron density 
(6.296 and 3.090, respectively) and number of exons (9.60 
and 5.41, respectively). We also found that GC 3 is negatively 



associated with intron density (r=-0.36, P value < 0.0001) 
and with intron fraction (r= -0.40, P value < 0.0001). 

We found a significant association between methylation 
and GC 3 richness (table 3) in agreement with previous studies 
that described a positive correlation between GC 3 content 
and the variability of gene expression in grasses (Tatarinova 
et al. 2010). Studying the triangular relationship between 
methylation, gene expression, and GC 3 (figs. 2 and 3), we 
observe that GC 3 -rich genes tend to have more variable 
gene expression and lesser gene-body methylation levels 
than the GC 3 -poor genes. Moreover, methylation of CpG in 
coding regions has a nonlinear relationship with gene expres- 
sion. Both the most lowly and highly expressed genes have 
low levels of methylation while medium-expressed genes are 
more methylated, in agreement with the trends reported by 
Jjingo etal. (2012). These observations suggest the interplay of 
two or more forces that affect gene expression. GC 3 exhibits a 
trend from high GC 3 and low methylation to low GC 3 and 
high methylation. Highly methylated genes, associated with 
development, genomic imprinting, or silencing of transgenes, 
exhibit low expression levels. These results are consistent with 
the notion that methylated genes can undergo 5-methylcyto- 
sine deamination where mC^T. In such cases, the third po- 
sition can often undergo cytosine deamination reducing GC 3 
without affecting the protein sequence, whereas the first two 
nucleotides in the codon are less likely to mutate due to 
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Table 2 

Compactness of Rice Genes, Stratified by Expression and GC 3 



GC 3 


Exon 


Exons 


Intron Density 


Intron 


Intron Fraction 


Number 


Expression 




Length 




(per 1000nt) 


Length 


(Length) 


of ORFs 


(Standardized) 


GC 3 >0.800 


767 


2.47 


2.301 


1683 


62.4% 


428 


E> 1 




1132 


2.21 


1.132 


1085 


41.9% 


1215 


£<-1 


GC 3 < 0.491 


1503 


9.60 


6.296 


4249 


73.3% 


924 


E> 1 




1587 


5.41 


3.090 


3116 


60.1% 


386 


£<-1 



Table 3 



Four Classes of (n = 9,607) Rice Genes by GC 3 and Methylation 





GC 3 -Rich 


GC 3 -Poor 


High methylation 


289 


4787 


Low methylation 


3161 


1370 



Note.— Yates's % 2 = 4267.237. 



selective pressures to conserve amino acid sequences. Hence, 
methylated genes are expected to be GC 3 poor. 
Consequently, low-methylated genes have high GC 3 values 
and low average expression (fig. 3), where an increase of CpG 
methylation and high deamination rate lead to a drop in GC 3 
values; at the same time the average expression reaches the 
maximum for the broadly expressed genes. A further increase 
in methylation does not affect GC 3/ but rather reduces gene 
expression, leading to a repression of the gene (see supple- 
mentary materials and supplementary fig. S5, Supplementary 
Material online, for further details). 

To examine the effect of alternative splicing on the corre- 
lation between methylation and GC 3 , we next considered the 
relationship between GC 3 and gene-body methylation for 
intron-containing and intron-less genes. Lyko et al. (201 0) dis- 
covered that clusters of methylated cytosines are associated 
with alternatively spliced exons and that intron containing 
genes are more methylated than intron-less genes. Intron- 
less genes are, obviously, not subject to alternative splicing 
while genes with introns may be alternatively spliced. There 
are 2,648 intron-less genes in the dataset; for these the aver- 
age values of GQ = 0.63, GC 2 = 0.51 , and GC 3 = 0.77, com- 
pared with 6,959 intron-containing rice genes with average 
GC 1= 0.58, GC 2 = 0.47, and GC 3 = 0.61. Indeed, intron- 
containing genes are twice more methylated than intron- 
less genes (0.18 vs 0.09). As expected, intron-containing 
genes also exhibit a stronger positive relationship between 
the average methylation and expression, between the CV of 
gene expression and GC 3 , and stronger negative correlation 
between the CV of gene expression and methylation (table 4). 
Interestingly, we observed only a small difference in the cor- 
relations between the average methylation and GC 3 between 
intron-less (r= -0.6) and intron-containing (r= -0.67) 
genes. Therefore, splicing influences the relationship between 
methylation, expression, and nucleotide composition. 



Traditional microarray measurements, which ignore alter- 
native splicing, are not able to fully measure variability of gene 
expression. This may partially explain why when comparing 
intron-containing with intron-less genes, the first have higher 
average expression (1.41 vs 1.13, respectively) and lower CV 
of gene expression (0.92 vs 1 .28, respectively). We hypothe- 
size that apparent constitutive expression of hypermethylated, 
intron-containing genes can be a complex phenomenon, with 
different splicing forms expressed at different developmental 
stages, tissue types, and external conditions. We hypothesize 
that gene expression variability of hypomethylated, intron-less 
genes is achieved by transcriptional regulation. Overall, alter- 
native splicing evens may explain the differences in methyla- 
tion and expression levels between intron-less and intron- 
containing genes, but not the differences between GC 3 and 
gene-body methylation. 

A more general explanation of the relationship between 
gene expression and methylation involving the nucleosome 
was recently proposed by (Jjingo et al. 2012). The authors 
pointed out that CpG sites occur frequently across gene 
bodies and that in genes with low levels of expression, meth- 
ylation is prevented by dense nucleosome packing. By con- 
trast, in genes with average levels of expression these sites are 
accessible to DNMTs and hence are more likely to be methyl- 
ated. When expression is high, polymerases and DNMTs com- 
pete for the access to the same sites and hence methylation is 
suppressed again. 

Arabidopsis thaliana 

Arabidopsis has a narrow and unimodal distribution of GC 3 
and a bimodal distribution of methylation levels (fig. 1). 
Despite the apparent unimodality of the GC 3 distribution, 
Arabidopsis genes with GC 3 > 0.5 are significantly less meth- 
ylated than genes with GC 3 <0.5: P(methylation < 0.01 6| 
GC 3 > 0.5) = 0.72 and P(methylation < 0.01 6|GC 3 < 0.5) = 
0.33, suggesting a relationship between GC 3 and methyla- 
tion. More specifically, the increase of GC 3 composition 
is negatively correlated with gene-body methylation levels 
(fig. 4/\). Of the three methylation contexts, the most pro- 
nounced effect is observed for CpG methylation (r=-0.27, 
P value < 0.0001) (fig. 4), while CHG and CHH methylation 
levels appear to be less affected by GC 3 composition. 

In thale cress, the average relative abundance of the fre- 
quency of CG dinucleotide (genome signature, p C g) for all 
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Fig. 3. — Oryza sativa: Relationship between GC 3 (purple triangles), gene expression strength (blue diamonds), expression variability (red squares), and 
methylation. Standard error of the mean is below 0.03 (GC 3 ), 0.1 1 (expression), and 0.03 (expression variability). 



genes is 0.73. The relative abundance of methylation targets 
depends on GC 3 richness, for genes with GC 3 >0.5 (mean 
p CG = 0.91) and the remaining genes (average p CG = 0.71). 
There is also a relationship between methylation levels and 
PcG^PcG^ethylation < 0.01 6) = 0.84 while p CG (methylation 
>0.016) = 0.67. Hence, GC 3 -rich genes have more methyla- 
tion targets but are less methylated. Therefore, despite the 
unimodality of the GC 3 distribution in A thaliana, the relation- 
ship between methylation and GC 3 is similar to the pattern 
observed for rice. That is, like the other taxa, arabidopsis ex- 
hibits a nonlinear, saddle-like dependence, between the 
strength of gene expression and GC 3 (fig. 2A), but its gene 
expression variability grows almost linearly with GC 3 (fig. 25). 

To further study the relationship between tissue-specific 
gene-body methylation and tissue-specific expression, we 
next examined tissue-specific patterns across shoots and 
roots as these exhibit differences in morphology, gene expres- 
sion activity, and function. We investigated 1000 genes from 
the two tails of the log(shoots/roots) expression distribution 
(see Materials and Methods section) and compared the differ- 
ences between shoot and root body methylation levels for the 
two gene groups. We found that the average genie methyla- 
tion was similar for shoots and roots (0.063 in shoots vs 0.057 
in roots). However, for genes overexpressed in shoots, there 
was a negligible difference between shoot and root 



methylation, whereas for genes overexpressed in roots, on 
average, the "shoot" genes were 21% more methylated 
than the "root" genes (P value = 0.003). These results are 
again in agreement with Jjingo et al. (2012) and highlight 
the role of methylation in contributing to tissue-specific ex- 
pression. Interestingly, differences between methylation levels 
in shoots and roots increase with GC 3 for all methylation types 
(fig. 4B). 

In summary, GC 3 is positively correlated with both expres- 
sion variability and variation in genie methylation. There is also 
an inverse relationship between gene-body tissue-specific 
methylation and tissue-specific gene expression. 

Apis mellifera 

The GC 3 distribution of the European honey bee, Api. melli- 
fera, is a unimodal right skewed distribution with a long tail of 
high GC 3 values (fig. 1). The honey bee is a GC 3 -poor organ- 
ism, but it has a surprising medium and high GC 3 tail, 
containing approximately 25% of its genes with GC 3 >0.5. 
Based on the current annotation, 2.2% of all Api. mellifera 
genes encode receptors (such as Metabotropic glutamate re- 
ceptor, Toll-like receptor, Dopamine receptor type D2, D2-like 
dopamine receptor, Ephrin receptor, SIFamide receptor, 
Ecdysteroid receptor A isoform, Antennapedia protein, 
Nicotinic acetylcholine receptor alpha 1 subunit, Alpha- 
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glycosidase G-protein coupled receptor, and others). Genes 
with GC 3 > 0.505 are significantly enriched in receptor encod- 
ing genes, which account for 5.6% of these compared with 
1.3% in genes with GC 3 <0.12 (P value = 2.6E-7). The fre- 
quency of CG dinucleotides differ between GC 3 -rich genes 
(average pcg = 1-18) and GC 3 -poor genes (average 
p CG = 0.41). To further study the relationships between 
GC 3 richness, receptor genes, and methylation, we compared 
data from queen and worker bees. 

Queen and worker bees share the same genome but differ 
in size, appearance, and life span. While there is little differ- 
ence between the whole-genome methylation level of the 
worker and queen bees (around 1 % of cytosines in CpG con- 
texts are methylated in both), their genes differ significantly in 
methylation levels (fig. SA and B). This finding in is in agree- 
ment with a previous report that worker and queen bees differ 
in the methylation of approximately 550 genes (Lyko et al. 
2010). Lyko et al. (2010) also reported that unmethylated 
genes are enriched in receptors. The methylated genes 
encode proteins showing a higher degree of conservation 
than proteins encoded by non-methylated genes (Foret 
et al. 2009). Of the three methylation contexts, we observed 
that the average fraction of CG methylation per gene was 
associated with GC 3 composition (fig. 5C and D) in support 
of a putative GC 3 regulatory role. In other words, increases in 
GC 3 in bees are associated with a decrease in gene-body 
methylation levels, which are enriched for receptor encoding 
genes. Follow-up analyses of bee methylation patterns can be 
found elsewhere (Lyko et al. 2010; Foret et al. 2012). 

In addition to these relationships, we also found that dif- 
ferences between methylation levels in worker and queen 
bees depend on the nucleotide composition of coding re- 
gions. We analyzed the relationship between gene body 
methylation and GC 3 for queen and worker bees. The relative 
difference between gene-body methylation in queen and 
worker bees, defined as (0 - W)/(Q+W), depends on the 
methylation context (CpG, CHH, or CHG) (fig. SA). The rela- 
tive difference in CpG and in total methylation is low for the 
GC 3 -poor genes and increases substantially when GC 3 
approaches 0.4 after which methylation stays roughly the 
same for genes with GC 3 > 0.4 (fig. SA). Relative difference 
between CHH and CHG methylation decreases with the in- 
crease of GC 3 . The transition between the compositional en- 
vironments may be related to changes in the regulatory role of 
each region. The difference between CpG methylation levels 
between queen and worker is negative for low GC 3 genes 
(queen bee is less methylated) and becomes positive with an 
increase of GC 3 (fig. SB). Overall, GC 3 poor genes are more 
methylated than GC 3 -rich genes (fig. 5Cand D). As compared 
with the worker bees, queen bees have lower body methyla- 
tion levels for GC 3 -poor class (enriched for ubiquitously ex- 
pressed genes) and higher for GC 3 -rich class (enriched for 
receptor-encoding genes) (fig. SB). Since queen and worker 
bees play drastically different roles in the beehive, they activate 
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and rely onto different sets of genes (Aamodt 2009). Higher 
social role of the queen bee may require more elaborate in- 
teraction with environment, which necessitates more regula- 
tion of the GC 3 -rich receptor-encoding genes through 
methylation. Our observations agree with Foret et al. (2009) 
and Elango et al. (2009), who pointed out that ubiquitously 
expressed critical genes are methylated at the germ-line, while 
cast-specific genes lack methylation. Caste-specific genes 
remain unmethylated to allow for greater epigenetic flexibility 
and regulatory control (Elango et al. 2009). Greater degree of 
flexibility is important for certain classes of genes in other in- 
vertebrates: according to Gavery and Roberts (2010) and 
Roberts and Gavery (201 2), the ubiquitously expressed house- 
keeping genes tend to be hypermethylated while tissue-spe- 
cific and inducible genes are hypomethylated. 

Homo sapiens 

Coding regions of H. sapiens have a broad bimodal distribu- 
tion of GC 3 values (fig. \A) and a unimodal distribution of 
genie methylation levels, with a long tail toward low methyl- 
ation levels (fig. 1 B) (Chodavarapu et al. 201 0). As in the other 
three species, the relative abundance of CpG dinucleotides 
differs for GC 3 -rich and -poor genes: p C G(nch) = 0.68 and 
PcG(poor) = 0.29. Overall, the H. sapiens genome is more 



methylated than bee, rice, and arabidopsis (fig. IB). 
Although the nonlinear dependence between GC 3 and 
gene expression is apparent (table 1, fig. 2A and B), its 
shape differs compared with the other three species we ana- 
lyzed. In human, CpG methylation is negatively correlated 
with GC 3 and CHH and has no significant correlation with 
CHG methylation (table 1 and fig. 6). The weak correlation 
between GC 3 , expression, and methylation suggests the exis- 
tence of other evolutionary forces affecting gene expression in 
the human genome. 

The Compositional Environment and Gene-Body 
Methylation Paradox 

A pronounced pattern that emerged from all our analyses is 
that GC 3 -rich genes are, on average, undermethylated, 
despite their enrichment of CpG dinucleotides. To further il- 
lustrate this trend, we compared the GC 3 gradient (supple- 
mentary fig. S1 , Supplementary Material online) and the CG 3 - 
skew (supplementary fig. S2, Supplementary Material online) 
across all tested taxa with gradients of methylation levels using 
the same groups of GC 3 -rich and GC 3 -poor genes (supple- 
mentary fig. S3, Supplementary Material online). The positive 
5 / -3 / gradient of body methylation, where methylation in- 
creases toward the mid-portion of the transcribed part of 
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Fig. 5. — Apis mellifera: (A) Relative difference in gene-body methylation levels (Q - 1/1/)/ (0+1/1/) as a function of GC 3 between worker and queen bee. 
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the gene can be attributed to a gene experiencing "boundary 
effects" from the attachment of transcriptional and transla- 
tional machinery. At the 5 r -end methylation needs to be low 
to enable attachment of proteins. Deamination of methylated 
cytosines in broadly expressed and highly methylated GC 3 - 
poor genes leads to the decrease in C nucleotides and nega- 
tive CG 3 -skew in the middle of the gene (supplementary fig. 
S2, Supplementary Material online). Although GC 3 -rich genes 
are enriched in methylation targets, they are undermethylated 
compared with GC 3 -poor genes. In fact, GC 3 -rich genes were 
so hypomethylated that we had to log-transform the methyl- 
ation levels to be able to plot the two trends on the same 
figure. Additional evidence of the different regulatory roles 
GC 3 -poor and GC 3 -rich genes assume in methylation can be 
found by looking at the competing process of cytosine deam- 
ination reducing methylation targets. 

GC 3 -rich and GC 3 -poor genes exhibit different body meth- 
ylation levels and different gradients of methylation in coding 
regions (see Supplementary Materials and supplementary 
figs. S1-S3, Supplementary Material online). The variation in 
compositional gradients may explain the under methylation 
observed in GC 3 -rich genes. Methylation level of GC 3 -poor 
genes experiences steep growth in the first 100 codons (300 



nucleotides) and then stays approximately constant (supple- 
mentary fig. S3, Supplementary Material online). With the 
exception of H. sapiens H1 cell line, methylation levels of 
GC 3 -rich genes are position-independent. As shown by 
Tatarinova et al. (2010) and by Sablok et al. (201 1), towards 
the middle of the gene, GC 3 -rich genes continuously become 
more C-rich (positive CG 3 -skew), whereas GC 3 -poor genes 
become G-rich (negative CG-skew); GC 3 -rich genes become 
even more GC 3 rich towards the middle of the gene, and GC 3 - 
poor genes become more GC 3 poor. We hypothesize that for 
the broadly expressed and highly methylated GC 3 -poor genes, 
the decrease in C nucleotides may be due to cytosine deam- 
ination (mC^T transitions). 

To this end, we next looked at genes of Drosophila mela- 
nogaster, which belongs to the so-called "Dnmt2 only" or- 
ganisms that do not contain any of the canonical DNA 
methyltransferases (Dnmtl and Dnmt3) (Krauss and Reuter 
201 1). The levels of DNA methylation in the fruit fly are sig- 
nificantly lower than in other organisms (Lyko et al. 2000). In 
the fruit fly, GC 3 content is positively associated with strength 
of gene expression (supplementary fig. S4D, Supplementary 
Material online). For the 300 genes with GC 3 < 0.55, average 
expression across 71 conditions is 2.12 on the log 10 scale, 
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versus average expression of 3.62 for the 300 genes with 
GC 3 > 0.8. Surprisingly, variability of fruit fly gene expression 
does not seem to be affected by GC 3 . In addition, average 
genome signatures, for both GC 3 -rich and -poor fly genes are 
even (pcg= 0.9). 

We compared the 5' to 3' gradients of CG skew in bee, 
thale cress, rice, and human (where a significant degree of 
gene-body methylation exists) with those in the fruit fly. In the 
first four taxa, we observed drastically different 5 r to 3 r gradi- 
ents of CG skew in both GC 3 -rich and GC 3 -poor genes (sup- 
plementary fig. S2, Supplementary Material online), whereas 
in the fruit fly these trends are absent (supplementary fig. S4C, 
Supplementary Material online). Decreased methylation of 5' 
regions was previously described by Roberts and Gavery 
(2012). In other words, the unmethylated fly genes exhibit 
similar GC 3 5 / -3 / gradients (supplementary fig. S4B, Supple- 
mentary Material online) to those of the other taxa. However, 
due to the absence of cytosine deamination there are even 
levels of Cs and Gs for both fly GC 3 -rich and -poor genes, 
whereas in the other taxa cytosine deamination reduces the 
number of Cs for the highly methylated GC 3 -poor genes in a 
position-specific manner (supplementary fig. S2, Supplemen- 
tary Material online). 

Discussion 

Gene-body methylation and gene expression exhibit complex 
relationships with one another and with sequence composi- 
tion. For example, DNA methylation in coding and non-coding 
regions have opposite effects on gene expression: in 



promoters, cytosine methylation often makes transcription 
factor binding sites inaccessible to transcription factors and 
is responsible for transcriptional repression in A thaliana 
(Chan et al. 2005) while gene-body methylation is reported 
to be positively correlated with gene expression in H. sapiens 
(Hellman and Chess 2007). Generally, these relationships ex- 
hibit similarity across diverse taxa, but may vary for particular 
genes. For example, Aceituno et al. (2008) noted that in 
A thaliana housekeeping genes that have broad and steady 
expression levels were more body-methylated than expected 
based on whole-genome methylation levels (P=1.5E-35). 
Only 8% of the hypervariable genes, such as stress response 
or tissue specific genes with high values of gene expression 
coefficient of variation, were found to be body-methylated. 
Aceituno et al. (2008) also reported that gene body methyla- 
tion is negatively correlated (r= -0.89) with the variability of 
gene expression on a genome-wide scale, implying that 
housekeeping genes having low expression variability have 
higher methylation levels and vice versa. This report follows 
Bird et al.'s (1995) hypothesis that gene-body methylation 
could be responsible for the repression of spurious transcrip- 
tion within genes and hence lead to more reliable transcrip- 
tion, which results in a positive correlation between gene 
expression and gene-body methylation. This relationship was 
previously described as exhibiting a bell-shaped distribution 
(Zilberman et al. 2007; Zemach et al. 2010). 

To better understand the regulatory role of gene-body 
methylation and its relationship with sequence composition, 
we studied the role of GC 3 in four taxa: rice, thale cress, bee, 
and human. We showed that GC 3 richness and methylation 



Genome Biol. Evol. 5(8): 1443-1 456. doi:10.1093/gbe/evt103 Advance Access publication July 5, 2013 



1453 



Tata ri nova et al. 



GBE 



are negatively correlated, which leads to a seeming paradox: if 
GC 3 -rich genes are enriched in methylation targets, why are 
they undermethylated compared with GC 3 -poor genes? One 
reason for this negative correlation may be due to the preva- 
lence of ubiquitously expressed genes in the GC 3 -poor 
class that use body methylation as one of the mechanisms 
to maintain broad expression. Association between alternative 
splicing, gene expression, and methylation allows us to hy- 
pothesize that the alternatively spliced intron-containing 
genes and oppositely, the intron-less achieve gene expression 
variability via different mechanisms. Hypomethylation of 
intron-less, high GC 3 genes and abundance of methylation 
targets allows achieving higher regulatory control. Hyperme- 
thylated, intron-containing, low GC 3 genes can express differ- 
ent spicing forms and be expressed at different developmental 
stages, tissue types, and external conditions. It is thus not 
surprising that GC 3 -rich, hypomethylated genes have higher 
genetic diversity as compared with the GC 3 -poor, hypermeth- 
ylated genes (Tatarinova et al. 201 0; Lyko et al. 201 0; Roberts 
and Gavery 2012). 

We propose that the opposite effects of methylation and 
compositional gradients along CDS of GC 3 -poor and GC 3 -rich 
genes (supplementary fig. S3, Supplementary Material online) 
are the products of two or more competing processes. The 
first driver is transcriptional efficiency. There may be a "uni- 
versal pressure" to increase the fraction of C-ending codons 
from the 5 r to the 3' end of the gene that can be explained by 
the need to increase the speed of transcription in this direc- 
tion. This is especially important for stress-specific genes (that 
are frequently GC 3 -rich) (Tatarinova et al. 201 0), since they are 
expressed as a response to a certain environmental condition, 
likely at a high level, for a limited amount of time resulting in a 
large number of RNA polymerases (RNAPs) that move simul- 
taneously along the same track. Hence, it is necessary to avoid 
RNAP congestion and increase the speed of transcription. 
There is no such pressure for ubiquitously expressed genes 
(frequently GC 3 -poor), since RNAP congestion effects are 
not likely to occur. 

The competing process may be cytosine deamination, 
which affects more methylated genes and genes that are ex- 
pressed at relatively constant levels across tissues. GC 3 -rich 
genes are less methylated and are likely to have limited 
tissue-specific and stress-specific expression patterns that re- 
quire less time in the transcriptional bubble. Therefore, the 
effect of cytosine deamination is less pronounced in GC 3 - 
rich genes. For GC 3 -rich genes, transcriptional kinetics is the 
winning driver. 

Takuno and Gaut (2012) hypothesized that "body-methyl- 
ated genes would be both longer and more functionally im- 
portant than unmethylated genes." The authors suggested 
that methylation has a functional role, such as maintaining 
transcriptional accuracy and splicing efficiency, thus explaining 
why the GC 3 -poor housekeeping genes are overall highly 
methylated. This agrees with our findings (table 2) that GC 3 - 



poor genes are longer (e.g., in rice, GC 3 -rich genes are on 
average 1031 nt long and GC 3 -poor genes are on average 
1648nt long) and have more exons (e.g., in rice, GC 3 -rich 
genes have on average 2.38 exons and GC 3 -poor genes 
have on average 8.57 exons). Takuno and Gaut (2012) also 
found that "body-methylated genes evolve more slowly than 
unmethylated genes, despite the potential for increased mu- 
tation rates in methylated CpG dinucleotides." This is also 
consistent with our observation (Tatarinova et al. 2010) of 
faster evolution of unmethylated GC 3 -rich genes as compared 
with methylated GC 3 -poor genes. Finally, we have shown that 
methylated genes have a lower proportion of CpG nucleo- 
tides, which supports the deamination hypothesis. 

Overall, our work supports and expands recent findings by 
Takuno and Gaut (2012) and Roberts and Gavery (2012). We 
propose several possible explanations to the question of why 
GC 3 -rich genes are enriched in CpG dinucleotides compared 
with GC 3 -poor genes: first, these sites may have played a 
regulatory role in the past and are maintained in the 
genome to allow phenotypic plasticity by increasing the 
number of transcriptional opportunities (Roberts and Gavery 
2012). Second, these sites may have an active regulatory role 
that has yet to be determined. Third, we suggest considering 
the problem from a different angle — that while GC 3 -poor 
genes have less CpG sites than GC 3 -rich genes, they are 
more body-methylated because as methylation increases in 
the 5 r ^3 r direction, there is more chance for mC^T muta- 
tion towards the middle of the gene. Most of the GC 3 -poor 
genes are ubiquitously expressed; therefore, the sense strand 
spends more time unprotected during transcription 
(Tatarinova et al. 2003). The cytosines are therefore lost in 
the deamination processes and the CG 3 -skew value is re- 
duced. Since the third position in the codon is not under pres- 
sure to conserve the protein sequence, the mC^T mutations 
are manifested as gene's GC 3 poorness. In support of this 
view, the 5' end of genes has a lower level of methylation 
and positive gradient of CG 3 -skew for both GC 3 -rich and 
GC 3 -poor genes, which can be explained by transcription/ 
translation initiation requirements. 

If methylation is associated with transcription, then the 
ubiquitously active genes should lose GC 3 due to deamination 
while the inducible ones should not. Looking at the gene-body 
methylation and GC 3 composition as a function of the nor- 
malized average gene expression in rice (supplementary fig. 
S5, Supplementary Material online), methylation and GC 3 
have opposing trends: where GC 3 increases, methylation de- 
creases and vice versa. Normalized gene expression between 
-1 and +1 contains many of the ubiquitously expressed 
genes, and in this region a decrease in GC 3 is accompanied 
by an increase in methylation. Methylation and GC 3 of induc- 
ible genes, having low average exprerssion (below -1 in sup- 
plementary fig. S5, Supplementary Material online) are not 
affected by the change in gene expression. 
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Our observation that the unmethylated fly genes exhibit 
similar GC 3 5 / -3 / gradients to those of the other taxa but dif- 
ferent patterns of CG 3 -skew supports the significance of cy- 
tosine deamination. In the fruit fly, due to the absence of 
cytosine deamination, levels of Cs and Gs for both GC 3 -rich 
and -poor genes are approximately the same, whereas in the 
other taxa cytosine deamination reduces the number of Cs for 
the highly methylated GC 3 -poor genes. 

We note that in addition to the processes described here, 
there are two major forces affecting GC 3 . One is GC-biased 
gene conversion (BGC) (Duret 2008), which is common to all 
our model species (Duret and Arndt 2008; Duret and Galtier 
2009; Katzman et al. 201 1 ; Muyle et al. 201 1 ; Gunther et al. 
201 2; Kent et al. 201 2). The other is selection on codon usage, 
which has been shown to occur in Arabidopsis (Muyle et al. 
201 1 ; Gunther et al. 201 2). It has been suggested that recom- 
bination hotspots can create strong substitution hotspots that 
are correlated with gene density that drive the evolution of GC 
content (Duret and Arndt 2008; Tatarinova et al. 2010). 
Affecting both coding and non-coding regions, BGC may 
lead to enrichment in GC content in genomic regions of 
high recombination compared with regions of low recombi- 
nation and may explain the patterns observed in human. 
Coding regions may also be susceptible to codon usage bias 
that directly affects the frequency of GC 3 . The complex inter- 
play between these forces and their relative effect on meth- 
ylation and gene expression in different species remains 
unclear and provides a fertile area for future studies. 

Conclusions 

We report strong negative correlations between CpG meth- 
ylation and the GC 3 content of genes in rice, bees, 
Arabidopsis, and humans. We propose several explanations 
for the triangular relationship between GC 3 , methylation, 
and expression patterns. The negative correlation between 
GC 3 and methylation can be explained by the prevalence of 
ubiquitously expressed genes in the GC 3 -poor class that use 
body methylation as one of the mechanisms to maintain 
broad expression. Positive 5 / -3 / gradient of body methylation, 
where methylation levels rise toward the mid-portion of the 
transcribed part of the gene, can be attributed to a gene 
experiencing "boundary effects" from the attachment of 
transcriptional and translational machinery. We propose that 
the opposite effects of methylation and compositional gradi- 
ents along CDS of GC 3 -poor and GC 3 -rich genes are the prod- 
ucts of two or more competing processes. The first driver is 
transcriptional efficiency. The competing process may be cy- 
tosine deamination, which affects more methylated genes 
and genes that are expressed at relatively constant levels 
across tissues. GC 3 -rich genes may be enriched in CpG dinu- 
cleotides as compared with GC 3 -poor genes for a number of 
reasons: firstly, these sites may have played a regulatory role in 
the past and are maintained in the genome to allow 



phenotypic plasticity. Secondly, these sites may have an 
active regulatory role that has yet to be determined. Thirdly, 
cytosine deamination may reduce the frequency of CpG di- 
nucleotides in ubiquitously expressed (GC 3 -poor) genes. 

Supplementary Material 

Supplementary material S1-S4, tables S1-S5 and figures 
S1-S6 are available at Genome Biology and Evolution online 
(http://www.gbe.oxfordjournals.org/). 
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